Skip to content

Conversation

@wietzesuijker
Copy link
Contributor

Problem

The item at https://api.explorer.eopf.copernicus.eu/stac/collections/sentinel-2-l2a-dp-test/items/S2A_MSIL2A_20251023T105131_N0511_R051_T31UET_20251023T122522 doesn't preview data properly.

Root cause: The workflow template was not passing override parameters to convert.py, so conversions were using incorrect defaults instead of the collection-specific configs.

Changes

  • Restore passing of override_groups, override_spatial_chunk, override_tile_width, override_enable_sharding to convert.py
  • Fix --enable-sharding arg to accept string values instead of boolean flag (allows empty string fallback)
  • When overrides are empty strings, convert.py uses collection defaults from CONFIGS

Testing

Will test with workflow run after Docker image builds.

Add complete Argo Workflows infrastructure for geozarr pipeline with automated AMQP event triggering.

Workflow pipeline:
- Convert: Sentinel-2 Zarr → GeoZarr (cloud-optimized)
- Register: Create STAC item with metadata
- Augment: Add visualization links (XYZ tiles, TileJSON)

Event-driven automation:
- AMQP EventSource subscribes to RabbitMQ queue
- Sensor triggers workflows on incoming messages
- RBAC configuration for secure execution

Configuration:
- Python dependencies (pyproject.toml, uv.lock)
- Pre-commit hooks (ruff, mypy, yaml validation)
- TTL cleanup (24h auto-delete completed workflows)
Add STAC registration, augmentation, and workflow submission scripts.

- register_stac.py: Create/update STAC items with S3→HTTPS rewriting
- augment_stac_item.py: Add visualization links (XYZ tiles, TileJSON)
- submit_via_api.py: Submit workflows via Argo API for testing
- Retry with exponential backoff on transient failures
- Configurable timeouts via HTTP_TIMEOUT, RETRY_ATTEMPTS, RETRY_MAX_WAIT
- Workflow step timeouts: 1h convert, 5min register/augment
Add operator notebooks and environment configuration.

- Submit workflow examples (AMQP and direct API)
- Environment variable template (.env.example)
- .gitignore for Python, IDEs, Kubernetes configs
Add container build configuration and development tooling.

- Dockerfile for data-pipeline image
- Makefile for common tasks (build, push, test)
- GitHub Container Registry integration
Add comprehensive testing and project documentation.

- Unit tests for register_stac and augment_stac_item
- Integration tests for workflow submission
- E2E test configuration
- Project README, CONTRIBUTING, QUICKSTART guides
- CI workflow (GitHub Actions)
Extend pipeline to support Sentinel-1 GRD collections:

- S1 GRD workflow configuration and test payloads
- Collection detection logic (get_crs.py extended for S1)
- Staging namespace deployment (rbac-staging.yaml)
- S1-specific STAC registration handling
- End-to-end S1 test suite
- v20-v22 image iterations with S1 support

Enables multi-mission pipeline supporting both S2 L2A and S1 GRD products.
Add comprehensive S1 GRD pipeline documentation and example code.

docs/s1-guide.md:
- S2 vs S1 feature comparison (groups, flags, chunks, polarizations)
- Collection registry config for sentinel-1-l1-grd
- Preview generation logic (grayscale with polarization detection)
- Test data sources (EODC STAC)
- Workflow parameters for S1 conversion
- Known issues (GCP reprojection, memory, TiTiler rescaling)

examples/s1_quickstart.py:
- End-to-end S1 pipeline: fetch → convert → register → augment
- Demonstrates S1-specific flags: --gcp-group, --spatial-chunk 2048
- Example using EODC S1C_IW_GRDH test item
- Local development workflow

Usage:
  python examples/s1_quickstart.py
Generalize pipeline through collection registry pattern:

- Collection-specific parameter registry (groups, chunks, tile sizes)
- Dynamic parameter lookup script (get_conversion_params.py)
- Registry integration across all workflow stages
- Support for S2 L2A and S1 GRD with distinct parameters
- Kustomize-based deployment structure

Enables scalable addition of new missions (S3, S5P, etc.) through
registry configuration without code changes.
Add comprehensive performance measurement and validation:

- Automated validation workflow task (validate_geozarr.py)
- Performance benchmarking tools (benchmark_comparison.py, benchmark_tile_performance.py)
- Production metrics from 9 operational workflows (8.6min avg, 75% success)
- Ecosystem compatibility validation (zarr-python, xarray, stac-geoparquet)
- User guide for adding new collections (docs/ADDING_COLLECTIONS.md)
- Performance report with operational metrics (docs/PERFORMANCE_REPORT.md)

Production validation shows pipeline ready for deployment with
validated performance and ecosystem compatibility.
Enable parallel chunk processing with Dask distributed:

- Add --dask-cluster flag to conversion workflow
- Update to v26 image with Dask support
- Add validation task between convert and register stages

Initial test shows 1.6× speedup (320s vs 516s baseline).
Task was defined but never referenced in DAG (lines 25-37).
Add workflow parameters:
- stac_api_url, raster_api_url (API endpoints)
- s3_endpoint, s3_output_bucket, s3_output_prefix (S3 config)

Replace all hardcoded values with parameter references for:
- STAC/raster API URLs in register/augment tasks
- S3 endpoint in all tasks
- S3 bucket/prefix in convert/validate/register tasks

Enables easy environment switching (dev/staging/prod) via parameter override.
Three Jupyter notebooks demonstrating GeoZarr data access and pyramid features:

01_quickstart.ipynb
- Load GeoZarr from S3 with embedded STAC metadata
- Visualize RGB composites
- Inspect geospatial properties

02_pyramid_performance.ipynb
- Benchmark tile serving with/without pyramids
- Measure observed 3-5× speedup at zoom 6-10
- Calculate storage tradeoffs (33% overhead)

03_multi_resolution.ipynb
- Access individual pyramid levels (0-3)
- Compare sizes (4.7MB → 72KB reduction)
- Explore quality vs size tradeoffs

These notebooks help users understand the pipeline outputs and evaluate
pyramid benefits for their use cases. Still evolving as we refine the
conversion process and gather production feedback.
Replace inline bash script in workflows/amqp-publish-once.yaml with
scripts/publish_amqp.py. Script is now included in Docker image,
eliminating need for runtime pip installs and curl downloads.

Changes:
- Add scripts/publish_amqp.py with routing key templates and retry
- Update workflows/amqp-publish-once.yaml to use pre-built image
- Add workflows/ directory to docker/Dockerfile
- Add tests/unit/test_publish_amqp.py with pytest-mock
20 tests: pattern matching, S1/S2 configs, CLI output formats
Tests asset priority logic (product > zarr > any .zarr) and error handling
for missing or malformed STAC items.
Tests subprocess execution, timeout handling, error cases, and CLI
options including file output and verbose mode.
Measures load time and dataset metrics for performance comparison.
Outputs JSON results with speedup factor and format recommendations.
- Add show-parameters step displaying full workflow config in UI
- Add step headers (1/4, 2/4, etc) to all pipeline stages
- Add progress indicators and section dividers for better readability
- Add workflow metadata labels (collection, item-id) for filtering
- Fix sensor event binding (rabbitmq-geozarr/geozarr-events)
- Add S1 E2E test job (amqp-publish-s1-e2e.yaml)

Argo UI now shows:
  • Full payload/parameters in dedicated initial step
  • Clear step numbers and progress for each stage
  • Final URLs for STAC item and S3 output
  • Better context during long-running conversions
Complete validation report showing:
- Successful S1 GRD to GeoZarr conversion
- 21-minute workflow execution (30k x 15k resolution)
- 6-level multiscale pyramids for VV/VH polarizations
- STAC registration with preview links
- UI enhancements validated in Argo
- Collection registry parameters documented
- Fix sys.path in test_publish_amqp.py from parent.parent to parent.parent.parent
- Update S1 spatial_chunk test expectations from 2048 to 4096
- Aligns with code changes in get_conversion_params.py
- Remove test_real_stac_api_connection (only checked HTTP 200, no logic)
- Remove unused os import
- Test had external dependency, was flaky, redundant with mocked tests
- Format long argparse description lines for readability
- No functional changes, purely formatting
- Set archiveLogs: false for immediate log visibility via kubectl
- Change convert-geozarr from script to container template for stdout logs
- Reduce memory request to 6Gi (limit 10Gi) for better cluster scheduling
- Add Dask parallel processing info in comments
- Simplify show-parameters to basic output

Fixes 30-60s log delay in Argo UI. Logs now visible via kubectl immediately.
- Add run-s1-test.yaml for direct kubectl submission
- Update amqp-publish-s1-e2e.yaml with optimized test parameters
- Use S1A item from Oct 3 for consistent testing
- Add WORKFLOW_SUBMISSION_TESTING.md with complete test results
- Update README.md: reorganize by recommendation priority
- Document all 4 submission methods with pros/cons
- Add troubleshooting for log visibility and resource limits
- Simplify Quick Start to 2 commands (30 seconds)
- Document Dask integration and resource optimization

Covers kubectl, Jupyter, event-driven (AMQP), and Python CLI approaches.
Test validation proven by 93 passing tests, not narrative docs
- Configure pytest pythonpath to enable script imports (unblocks 90 tests)
- Add exception tracebacks to get_conversion_params error handlers
- Add error trap to validate-setup.sh for line-level diagnostics
- Replace timestamp-based Docker cache with commit SHA for precision
- Add pre-commit hooks (ruff, mypy) for code quality enforcement

Test results: 90/90 passing, 32% coverage
- Add integration-tests job in GitHub Actions (runs on PRs only)
- Add explicit resource requests/limits to all workflow templates
  - convert-geozarr: 6Gi/10Gi memory, 2/4 CPU
  - validate: 2Gi/4Gi memory, 1/2 CPU
  - register-stac: 1Gi/2Gi memory, 500m/1 CPU
  - augment-stac: 1Gi/2Gi memory, 500m/1 CPU

Prevents pod eviction and enables predictable scheduling
wietzesuijker and others added 24 commits October 22, 2025 18:31
- Fix all failing unit tests for refactored code
- Add comprehensive tests for create_geozarr_item.py
- Add test coverage for metrics module
- Update test fixtures for new script structure
- Achieve 99% coverage with clean output
- Move test utilities to tools/testing/
- Enable auto-build on all branches for rapid iteration
- Add validation dependencies to pyproject.toml
- Update uv.lock with latest dependencies
- Restructure README with clear Quick Start
- Add inline kubectl YAML examples
- Organize Usage into 3 methods (kubectl, AMQP, Jupyter)
- Add Workflow Steps section explaining pipeline
- Improve Configuration with subsections
- Enhance Troubleshooting with actionable commands
- Update CONTRIBUTING.md and GETTING_STARTED.md
- Update Makefile and examples/
- Remove validation step from pipeline (convert → register)
- Delete validate_geozarr.py script
- Remove tools/, examples/, tests/, docs/ directories
- Remove duplicate workflow YAMLs (rbac, sensor, eventsource at root)
- Consolidate markdown files to 3 (README, workflows/README, notebooks/README)
- Reduce to 6 core scripts (create, register, augment, params, utils, metrics)
- Update Makefile (remove test/test-cov/publish/deploy targets)
- Update pyproject.toml description for minimal pipeline
- Update workflows/base/workflowtemplate.yaml to 2-step DAG
- Update documentation for engineers familiar with Argo/K8s/STAC
- Create convert.py and register.py entry points
- Chain function calls in Python instead of bash
- Eliminate shell variable passing and multiple process spawns
- Preserve individual script CLI interfaces for standalone use
- Cleaner error handling with Python exceptions vs bash exit codes
The slim branch has no tests directory, so integration tests should be skipped like the main test job.
- Remove matrix strategy (not publishing a package, single Python version is sufficient)
- Remove integration-tests job (no integration tests exist)
- Remove hashFiles conditions (unnecessary complexity)
- Use Python 3.11 consistently (matches Docker image)
- Update requires-python to >=3.13
- Update Dockerfile base image to python:3.13-slim
- Update ruff target-version to py313
- Remove 3.11/3.12 classifiers, keep only 3.13
- Remove metrics.py
- Remove prometheus-client dependency
- Remove --enable-metrics flag from register.py
- Remove metrics imports and calls from register_stac.py
- Remove metrics import/usage from augment_stac_item.py
- Remove --enable-metrics from workflow template

Prometheus metrics can be added with a separate PR using the feat/prometheus-metrics-integration branch.
Keep slim branch focused on core pipeline (7 scripts, 1 job).
Notebooks can be re-added on separate feature branch with:
- Proper dependencies in pyproject.toml
- Plug-and-play setup (uv sync --extra notebooks)
- Remove unused environment variable override system
- Remove pattern matching complexity (only 2 missions)
- Simplify to direct prefix lookup (sentinel-1, sentinel-2)
- 157 → 100 lines (36% reduction)
- Same functionality, clearer code

Formats still work:
  --format json   (JSON output)
  --format shell  (shell variables)
  --param groups  (single param)
Register step now passes --s3-output-bucket and --s3-output-prefix
instead of pre-constructed --geozarr-url. Construction happens in
register.py using item_id extracted from source_url.

Workflow YAML: 130 → 111 lines (no inline Python)
register.py: bucket/prefix args, constructs s3://{bucket}/{prefix}/{collection}/{item_id}.zarr
- extract_item_id: replaced with urlparse().path.split()[-1]
- get_zarr_url: moved into convert.py
Remove workflow_dispatch and tags triggers from build workflow
Remove pull_request and workflow_dispatch triggers from test workflow
Fix permissions in test workflow (no write access needed for tests)
… artifacts

- Add S3 cleanup before conversion to remove stale base arrays
- Revert to Python entry points (convert.py, register.py) for maintainability
- Fix groups parameter type (string → list) for API compatibility
- Use clean args approach instead of inline bash scripts
- Fix TiTiler preview path to use overview arrays (/r10m/0:tci)

This addresses PR feedback by consolidating the cleanup fix with proper
Python-based workflow structure. All debugging iterations squashed.
The --crs-groups flag triggers prepare_dataset_with_crs_info() in data-model,
which writes CRS metadata via ds.rio.write_crs() and creates the spatial_ref
coordinate variable required by TiTiler validation.

Restores working configuration from commit 21ea009.
- workflows/README: explain secret purposes (event ingestion, storage, API auth)
- workflows/README: add direct OVH Manager links for kubeconfig and S3 credentials
- README: delegate setup to workflows/README
- Separate operator usage (root README) from deployment setup (workflows/README)
@wietzesuijker wietzesuijker force-pushed the fix/restore-conversion-override-params branch 2 times, most recently from a6fa81e to 247ca0d Compare October 30, 2025 12:36
@wietzesuijker wietzesuijker force-pushed the fix/restore-conversion-override-params branch from 0284a2b to 7e07d0c Compare October 30, 2025 15:14
@wietzesuijker wietzesuijker force-pushed the fix/restore-conversion-override-params branch from 7471869 to 7e07d0c Compare October 31, 2025 05:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants